Supervised Learning Detection of Sixty Non-transiting Hot Jupiter Candidates

Sarah Millholland; Gregory Laughlin

doi:10.3847/1538-3881/aa7a0f

1. Introduction

While the Kepler prime mission is best known for the detection of thousands of transiting extrasolar planets, notably including some that are Earth-sized and smaller (e.g., Batalha et al. 2013), its long time-baseline and high photometric precision allowed for the detection of other small-amplitude signals, such as out-of-transit photometric variations of short-period planets (Borucki et al. 2009). Long before Kepler was launched, the existence and detectability of these optical reflected light phase curves were predicted by Seager et al. (2000) and Jenkins & Doyle (2003).

Optical phase curves, along with their infrared counterparts, provide a spatially integrated time-dependent view of the planet's atmosphere. They yield information about the planet's atmospheric composition (Rowe et al. 2006; Demory et al. 2013), day-night temperature contrasts and heat redistribution (Knutson et al. 2007; Showman et al. 2009; Cowan & Agol 2011), cloud existence and reflectivity (Demory et al. 2013; Shporer & Hu 2015), atmospheric weather variability (Armstrong et al. 2016), and magnetic field strength (Rogers 2017). A recent review of phase curve methodology and scientific findings is provided by Shporer (2017).

A planet's optical phase curve is the superposition of four independent components: reflected star light, thermal emission, ellipsoidal variations, and Doppler beaming/boosting (Faigler & Mazeh 2011). Ellipsoidal variations are brightness changes resulting from tidal deformation to the star (Morris 1985), while Doppler beaming is a relativistic effect that causes an increase or decrease in light as the star approaches or recedes from the observer (Hills & Dale 1974; Rybicki & Lightman 1979). Ellipsoidal variations and Doppler beaming are both sensitive to the planet's mass, which thereby permits photometric mass constraints (e.g., Lillo-Box et al. 2016). For an average hot Jupiter ( $P\sim 3$ days, ${R}_{p}\sim 1.3\ {R}_{\mathrm{Jup}}$ , ${M}_{p}\sim {M}_{\mathrm{Jup}}$ ) orbiting a Sun-like star, the reflected light component is typically larger than the other components by about an order of magnitude.

There have been numerous detections of optical phase curves of transiting Kepler planets (e.g., Welsh et al. 2010; Shporer et al. 2011; Barclay et al. 2012; Mazeh et al. 2012; Lillo-Box et al. 2014). In addition, comprehensive searches for Kepler planetary phase curves and secondary eclipses have also been published (Coughlin & López-Morales 2012; Esteves et al. 2013; Angerhausen et al. 2015; Esteves et al. 2015; Lillo-Box et al. 2016). These systematic searches have established several interesting findings, including generally low (but occasionally quite large) hot Jupiter albedos (Demory et al. 2011) and repeated detections of shifts in the maximum of the phase curve away from the substellar point (Faigler et al. 2013; Faigler & Mazeh 2015; Hu et al. 2015; Shporer & Hu 2015).

Although it is substantial, the sample size (∼15) of Kepler transiting planets with detectable photometric variations is just short of the count necessary to adequately understand these atmospheric properties in a statistical or populational sense. It is therefore worthwhile to increase the number of planets with well-characterized optical phase curves by expanding the search to non-transiting planets and using the phase curve signal itself as a detection mechanism. Given their large expected radial velocity half-amplitudes, candidates identified in this manner can subsequently be confirmed with Doppler velocity measurements.

The prospects of detecting non-transiting planets from their photometric variations alone has been frequently discussed in the literature. The BEaming, Ellipsoidal, and Reflection/heating (BEER) model (Faigler & Mazeh 2011) was developed in part as a technique to discover non-transiting planets. To date, the BEER model has been very successful in the detection of non-transiting stellar binary companions (Faigler et al. 2012) and in the characterization of transiting planets with detectable phase curve variations (Faigler et al. 2013; Faigler & Mazeh 2015).

Placek et al. (2014) proposed using Bayesian model selection to test a light curve for the presence of significant reflected light, thermal emission, ellipsoidal variation, and/or Doppler beaming phase curve components. Millholland et al. (2016) considered the prospects of discovering non-transiting hot Jupiters in Kepler systems containing known transiting planets with the coupled detection of an optical phase curve and astrometric (stellar wobble-induced) transit-timing variations.

In all of these techniques, the primary challenge is that the ellipsoidal and Doppler beaming components are typically significantly smaller than the reflected light component for Jupiter-sized planets. The photometric signature is therefore mostly sinusoidal and difficult to distinguish from stellar or instrumental variability. These prior studies, however, have uniformly considered the phase-folded and time-averaged phase curve signal. When the signal is folded and binned, valuable time-dependent information is lost. This normally discarded information can be helpful in distinguishing planetary signals from stellar ones.

In this paper, we present a systematic procedure to detect non-transiting, short-period giant planets from their phase curves by exploiting the phase curves' time-dependent properties. The algorithm is rooted in two key ideas:

1.
A planetary phase curve is more temporally consistent in its amplitude and phase than most other stellar or instrumental light curve variability.
2.
The properties of phase curves of known transiting Kepler planets and synthetically generated data sets can be used as a training set on which novel light curves can be compared and classified using supervised learning.

This paper is organized as follows. In Section 2 we outline the pipeline we developed for detecting candidate phase curves in Kepler light curves. In Section 3 we inject synthetic phase curves into Kepler light curves and demonstrate the pipeline's efficiency in recovering them. In Section 4 we train a logistic regression algorithm to distinguish the properties of planetary phase curves from other signals. We test its classification capabilities on synthetic data sets and the set of Kepler transiting hot Jupiters. In Section 5 we employ the supervised learning algorithm to search for phase curves around Kepler FGK stars without known planets or planet candidates. In Section 6 we establish a catalog of candidate planets with phase curves and examine trends in the planet albedos and phase curve maximum offsets. We discuss additional properties of the candidates in Section 7 and conclude in Section 8.

2. An Automated Detection Pipeline

To detect non-transiting hot Jupiters in the Kepler field, we first construct an algorithm for identifying the single best candidate phase curve within a given light curve. The process consists of three distinct steps:

1.
Search for the periods of potentially significant approximately sinusoidal signals in the data by calculating Lomb–Scargle periodograms of initially detrended light curves.
2.
At each candidate period, find the best-fit parameters of a phase curve by simultaneously fitting for the phase curve and stellar components.
3.
Quantify the autocorrelations in the residuals of the fits.

The light curves we consider here are the pre-search data conditioning (PDC) photometry (Smith et al. 2012; Stumpe et al. 2012, 2014) provided by the Kepler Science Center and publicly available at Mikulski Archive for Space Telescopes (MAST).¹ The PDC light curves have undergone a removal of systematic errors common to all light curves.

2.1. Step 1—Identifying Candidate Periods

The first goal is to identify the periods of significant sinusoidal signals in the light curve. To do this, we must first damp the effects of stellar variability or residual instrumental systematics that could contaminate a periodogram. We model our detrending algorithm on the kepflatten routine in the PyKE software package² (Still & Barclay 2012).

We detrend the light curve by fitting it with the mean of polynomials that span a sliding window. In detail, we first split the light curve into sections of length s. Then, over a window of size w = Ns (with N an integer), we fit a third degree polynomial. We slide the window through all time sections, stepping in units of s. Within each section, we then calculate the mean of the N polynomials that were fit in that section, resulting in a smooth fitted curve. We then detrend the light curve data by dividing by the fit. At this stage, we calculate outliers that are deviant by more than $7\sigma$ , remove them, and iterate the process.

We perform two detrending calculations with different window sizes in an attempt to account for stellar variability at a variety of timescales. A step size of s = 2 days is used in each calculation, but the short-timescale detrending uses w = 6 days, while the long-timescale detrending uses w = 12 days. We aim to detect phase curves with orbital periods up to 5 days (although in reality, most detections will be shorter than 3 days). It is therefore critical that w be large enough to avoid fitting out the phase curve signals we wish to detect. This explains our choice of w = 6 for the short-timescale detrending.

On each of the two detrended curves associated with a given light curve, we generate a Lomb–Scargle periodogram (Lomb 1976; Scargle 1982) within a 1–5 day range. Hence, we assume that we will only be capable of detecting phase curves within this period range. Although there are several known hot Jupiters with sub-day periods (e.g., WASP-18b, Hellier et al. 2009), we found that extending the periodogram much below 1 day leads to significantly more noise and poorer overall results.

We also generate a "local significance" periodogram. At each period in the spectrum, we calculate the number of standard deviations of the periodogram power over the median power within a centered 0.5-day window. The purpose of this additional step is that the peak in a periodogram associated with a true phase curve is sometimes highly significant with respect to nearby powers, but not with respect to the entire 1–5 day range. As an illustrative example, Figure 1 displays the Lomb–Scargle periodogram and local significance periodogram for the transiting hot Jupiter Kepler-686b with a radius $R\sim 1.08\ {R}_{\mathrm{Jup}}$ and a period of $P\sim 1.595$ day. We removed the transits and secondary eclipses in the light curve before the computation.

**Figure 1.** Lomb–Scargle periodogram and local significance periodogram of the light curve of the transiting hot Jupiter Kepler-686b (KIC 3935914). Transits and secondary eclipses were removed before computation. In this example, the peak associated with the true period is not the highest in the Lomb–Scargle periodogram, but it is the highest in the local significance periodogram.
Download figure:
Standard image High-resolution image

We extract the highest peak in each of the four periodograms—one Lomb–Scargle and one local significance periodogram for each of the two detrending calculations. We find the unique periods (defined as separate by more than 0.01 days) among these, yielding a collection of 1–4 peak periods. These period(s) constitute candidate periods for a planetary phase curve. The candidate signals are analyzed in more detail in the next section.

2.2. Step 2—Phase Curve Fitting

The initial detrending calculations described above are sufficient for identifying the periods of significant sinusoidal signals in the data. To robustly characterize the candidate phase curves, however, it is necessary to model the planetary signal and stellar variability simultaneously, rather than attempting to remove the stellar signal a priori. To this end, we use the following steps to find the best-fit sinusoidal signal for each of the candidate period(s) identified.

The best fit is found by chi-square minimization of a multiplicative, two-component (phase curve and stellar variability) light curve model. The phase curve component is modeled by a fixed-period sinusoid with variable phase and amplitude. The stellar variability component, that is, the component remaining after the sinusoid is divided out, is fit using the sliding polynomial technique described in Section 2.1. Here we use three degree polynomials, a step size s = 2 days, and a window size w = 6 days. The best fit is characterized by the amplitude and phase of the sinusoid that minimizes the residuals of the joint fit. For the purposes of later comparison, we also perform a one-component fit without the sinusoid. In other words, this just contains the sliding polynomial fit. Examples of the one-component (star only) and two-component (star and phase curve) fits are shown for Kepler-13b in Figure 2. Kepler-13b is a transiting hot Jupiter with a radius $R\sim 1.51\ {R}_{\mathrm{Jup}}$ and a period of $P\sim 1.764\ \mathrm{day}$ and has a large-amplitude phase curve.

**Figure 2.** One- and two-component fits in a section of the light curve of Kepler-13b (KIC 9941662). The transits and secondary eclipses have been removed from the data. The black points are the PDC flux measurements. The green curve is the two-component phase curve and stellar variability fit, while the blue curve is the stellar component alone. The ratio of the Durbin–Watson statistics of the residuals of the two fits is displayed in the upper left corner.
Download figure:
Standard image High-resolution image

We note that techniques such as Gaussian process (GP) regression or autoregressive integrated moving-average (ARIMA) modeling might be better suited to capturing the stellar variability component of the light curve (e.g., Haywood et al. 2014; Barclay et al. 2015; Caceres et al. 2015). We did successfully apply a GP regression to take the place of the stellar variability polynomial fit, but we found that any improvements in the fit did not outweigh the larger computational costs.

2.3. Step 3—Quantifying Residual Autocorrelations

Given the best-fit sinusoidal signal in the data for each of the 1–4 candidate period(s), our goal is to diagnose each signal's phase and amplitude coherence. This information will be used to select the best of the 1–4 candidate signals and to recognize the time-dependent properties of true phase curves later in Section 4.

The phase and amplitude coherence is best examined with the autocorrelation of the fit residuals. If the signal experiences large phase and amplitude variations, a sinusoidal fit will be inappropriate, and the residuals will be significantly non-Gaussian. On the other hand, the signal of a true phase curve will have less strongly correlated residuals. We examine two diagnostics of the residual autocorrelation: the Durbin–Watson statistic (Durbin & Watson 1950), and the Ljung–Box statistic (Ljung & Box 1978).

The Durbin–Watson test statistic for residuals $\{{r}_{i}\}$ is given by

$\begin{eqnarray}&&d=\displaystyle \frac{{\displaystyle \sum }_{i=2}^{N}{({r}_{i}-{r}_{i-1})}^{2}}{{\displaystyle \sum }_{i=1}^{N}{{r}_{i}}^{2}}.\end{eqnarray} \tag{ 1 }$

In a Durbin–Watson test, d is compared to critical values at selected significance levels. The test is used to investigate the null hypothesis that the residuals exhibit zero autocorrelation against the alternative hypothesis that the residuals are positively or negatively autocorrelated. Uncorrelated residuals have $d\sim 2$ , while positively correlated residuals have $d\ll 2$ .

The Ljung–Box statistic is given by

$\begin{eqnarray}&&{Q}_{h}=N(N+2)\displaystyle \sum _{j=1}^{h}\displaystyle \frac{{\hat{{\rho }_{j}}}^{2}}{N-j},\end{eqnarray} \tag{ 2 }$

where $\hat{{\rho }_{j}}$ is the autocorrelation of the residuals at lag j, N is the number of data points, and h is the number of lags being tested. If the residuals are independently distributed, Q_h follows a chi-squared distribution with h degrees of freedom, ${Q}_{h}\sim {{\chi }_{h}}^{2}$ . The residuals show evidence for autocorrelation at significance level α if ${Q}_{h}\gt {\chi }_{1-\alpha ,h}^{2}$ . Rather than calculate the statistic at one lag, we calculate it at a variety of lags and plot Q_h versus h. For example, in Figure 3 we display the Q_h versus h curves for the residuals of the one- and two-component fits of Kepler-13b. (A section of these fits were shown in Figure 2.) Q_h exhibits a nearly linear dependence with the lag, with different slopes and y-intercepts between the two fits. Since the "star only" fit has more autocorrelated residuals, the slope and y-intercept of its Q_h curve are larger.

**Figure 3.** Ljung–Box statistic, Q_h, as a function of lag, h, for the residuals of the one-component and two-component light curve fits of Kepler-13b (KIC 9941662). A portion of the light curve fits was shown in Figure 2. The black dashed lines are linear fits. The ratios of the slopes and y-intercepts for the two lines are shown in the lower right corner.
Download figure:
Standard image High-resolution image

Given that the two-component (phase curve and stellar variability) fit described in Section 2.2 will rarely perfectly model a Kepler light curve, there will always be some serial autocorrelation. To make the Durbin–Watson and Ljung–Box statistics useful, we therefore do not perform the statistical tests directly. Instead, we use the ratio of the test statistics in the two-component fit to their values in the one-component fit. For the Durbin–Watson statistic and Ljung–Box slope and y-intercept, these ratios are denoted ${d}_{2}/{d}_{1}$ , ${m}_{2}/{m}_{1}$ , and ${b}_{2}/{b}_{1}$ , respectively. The ratios for Kepler-13b are displayed in Figures 2 and 3.

As described in Section 2.2, the one-component fit is the "star only" case, that is, the sliding polynomial fit to the data without a sinusoidal component. The ratio of the test statistics in the star and phase curve versus star only cases thus probes the degree to which the serial autocorrelation is improved or degraded by the inclusion of a sinusoidal signal in the data. Phase curves with high signal-to-noise ratios will exhibit a significant reduction in the autocorrelation when the two-component fit is performed, with ${d}_{2}/{d}_{1}\gg 1$ , ${m}_{2}/{m}_{1}\ll 1$ , and ${b}_{2}/{b}_{1}\ll 1$ . Similarly, we also examine the degree to which the two-component fit is favored using the ratio of the ${\chi }^{2}$ , denoted by ${{\chi }^{2}}_{2}/{{\chi }^{2}}_{1}$ . Strongly favored phase curve fits will have ${{\chi }^{2}}_{2}/{{\chi }^{2}}_{1}\ll 1$ .

Returning to the description of the pipeline, at this stage, the autocorrelation diagnostics have been calculated for the best-fit sinusoidal signals associated with each of the 1–4 peak periods. To choose a single best signal, we use the following minimization:

$\begin{eqnarray}&&\mathop{\min }\limits_{P\in \{{P}_{1},\ldots ,{P}_{4}\}}\left\{\displaystyle \frac{{{\chi }^{2}}_{2}}{{{\chi }^{2}}_{1}}-\displaystyle \frac{{d}_{2}}{{d}_{1}}+\displaystyle \frac{{m}_{2}}{{m}_{1}}\right\}.\end{eqnarray} \tag{ 3 }$

We thus select the signal with the best fit and lowest autocorrelation. This step concludes the procedure for identifying candidate phase curves in Kepler light curves.

3. Synthetic Phase Curve Injection and Recovery

In Section 2 we built a pipeline for the detection of a candidate phase curve signal in a Kepler light curve. We now construct and inject synthetic phase curves into Kepler light curves and demonstrate the pipeline's recovery efficiency. These synthetic phase curves will also serve as the training set for the logistic regression in Section 4.

3.1. Phase Curve Model

Optical phase curve models have been constructed by various authors, including Faigler & Mazeh (2011), Barclay et al. (2012), Placek et al. (2014), Angerhausen et al. (2015), and Esteves et al. (2013, 2015). Here we adopt the notation and model of Esteves et al. (2015). The light curve of a non-transiting planet is composed of three components: the flux coming from the planet (both reflection and thermal), F_p, and the Doppler beaming and ellipsoidal variation components from the star, F_d and F_e. The composite light curve is given by

$\begin{eqnarray}&&F={F}_{p}(\phi +\theta )+{F}_{d}(\phi )+{F}_{e}(\phi ).\end{eqnarray} \tag{ 4 }$

The curve is a function of the phase, ϕ, which ranges from 0 to 1 and is given by

$\begin{eqnarray}&&\phi =\displaystyle \frac{t-{T}_{\mathrm{mid}}}{P},\end{eqnarray} \tag{ 5 }$

where ${T}_{\mathrm{mid}}$ is the time of inferior conjunction (or mid-transit for a transiting planet).

The planetary brightness is modeled as a Lambert sphere, described by

$\begin{eqnarray}&&{F}_{p}={A}_{p}\displaystyle \frac{\sin z+(\pi -z)\cos z}{\pi },\end{eqnarray} \tag{ 6 }$

where $\cos z=-\sin i\cos (2\pi [\phi +\theta ])$ . The planetary brightness is approximately

$\begin{eqnarray}&&{A}_{p}\approx {F}_{\mathrm{ecl}}={A}_{g}{\left(\displaystyle \frac{{R}_{p}}{a}\right)}^{2},\end{eqnarray} \tag{ 7 }$

and θ is a phase offset of the peak brightness from the substellar point.

The Doppler beaming flux is given by

$\begin{eqnarray}&&{F}_{d}={\left(\displaystyle \frac{2\pi G}{P}\right)}^{1/3}\displaystyle \frac{{M}_{p}{\alpha }_{d}\sin i}{c\ {{M}_{\star }}^{2/3}}\left(\displaystyle \frac{1+e\cos \omega }{\sqrt{1-{e}^{2}}}\right)\sin (2\pi \phi ).\end{eqnarray} \tag{ 8 }$

The constant ${\alpha }_{d}$ is the photon-weighted, bandpass-integrated beaming coefficient.

The ellipsoidal variation component is, using the first three cosine harmonics,

$\begin{eqnarray}{F}_{e} & = & -\displaystyle \frac{{M}_{p}{\alpha }_{2}{\sin }^{2}i}{{M}_{\star }}\left[{\left(\displaystyle \frac{a}{{R}_{\star }}\right)}^{-3}\cos (4\pi \phi )\ \right.\\ & & +\left.{\left(\displaystyle \frac{a}{{R}_{\star }}\right)}^{-4}{f}_{1}\cos (2\pi \phi )+{\left(\displaystyle \frac{a}{{R}_{\star }}\right)}^{-4}{f}_{2}\cos (6\pi \phi )\right],\end{eqnarray} \tag{ 9 }$

where

$\begin{eqnarray}{f}_{1} & = & 3{\alpha }_{1}\displaystyle \frac{5{\sin }^{2}i-4}{\sin i}\\ {f}_{2} & = & 5{\alpha }_{1}\sin i,\end{eqnarray} \tag{ 10 }$

and ${\alpha }_{1}$ and ${\alpha }_{2}$ are functions of the limb-darkening and gravity-darkening parameters (Esteves et al. 2013).

Although ${\alpha }_{d}$ is a function of the stellar spectrum and the Kepler transmission function, and ${\alpha }_{1}$ and ${\alpha }_{2}$ are complicated functions of limb-darkening parameters, they all exhibit nearly linear dependence with T_eff within the range of interest ( $4000\ {\rm{K}}\lt {T}_{\mathrm{eff}}\lt 7000\ {\rm{K}}$ ). In Figure 4 we plot the Doppler and ellipsoidal coefficients versus T_eff for 12 host stars of planets with phase curves from Esteves et al. (2015). We also show best-fit lines. There is a clear linear relationship for each α coefficient with ${T}_{\mathrm{eff}}$ . The trends agree well with an unpictured additional star with ${T}_{\mathrm{eff}}=4550\ {\rm{K}}$ .

**Figure 4.** Doppler beaming constant, ${\alpha }_{d}$ , and ellipsoidal variation constants, ${\alpha }_{1}$ and ${\alpha }_{2}$ , vs. T_eff for 12 stars in Esteves et al. (2015). The equations in the upper right corners correspond to the best-fit lines. The quality of the fits illustrates that parametrization of these coefficients as simple functions of T_eff should be sufficient for our modeling purposes.
Download figure:
Standard image High-resolution image

Given the high quality of the fits, we simplify our phase curve model by letting ${T}_{\mathrm{eff}}$ be a parameter, with the α coefficients calculated using the best-fit lines pictured in Figure 4. Considering Equations (6), (8), and (9), 9 parameters are required to model a phase curve: $P,{M}_{p},{R}_{p},i,{A}_{g},\theta ,{M}_{\star },{R}_{\star },\,\mathrm{and}\,{T}_{\mathrm{eff}}$ . We assume that $e\approx 0$ since planets with detectable phase curves are typically hot Jupiters with low eccentricities.

3.2. Light Curve Injection

In order to inject simulated phase curves into Kepler data sets, the most thorough procedure would be to insert the signal into pixel-level data, as has been done in measuring the transit signal recovery of the Kepler pipeline (Christiansen et al. 2013, 2015, 2016). However, Christiansen et al. (2013) showed an extremely high fidelity in the preservation of individual injected transit signals from the pixel-level data through image calibration, aperture photometry, and PDC systematics removal. Given this precedent, it should be appropriate to inject synthetic phase curves into the PDC light curve data, rather than the pixel-level data.

We picked a random sample of 10,000 Kepler FGK main-sequence stars without confirmed planets or Kepler Objects of Interest (KOIs) and extracted their PDC light curves. Here we define FGK dwarfs with the simple cuts used by Christiansen et al. (2016): $4000\,{\rm{K}}\lt {T}_{\mathrm{eff}}\lt 7000\ {\rm{K}}$ and $\mathrm{log}\,g\gt 4.0$ . For each target, we generated a single simulated phase curve using the nine physical quantities listed in Table 1. Here ${ \mathcal U }$ denotes a uniform distribution. The stellar parameters $\ {M}_{\star },{R}_{\star }$ , and T_eff are absent from this list. Instead of drawing these parameters randomly, we use the values of the target star. It is important to note that the parameter space spanned by the distributions in Table 1 need not reflect the true distribution of planets; instead, they must encompass all plausible phase curve amplitudes and morphologies capable of being observed.

Table 1. Parameter Distributions of Synthetic Phase Curve Injections

Parameter	Distribution
Period, P [days]	$P\sim { \mathcal U }[1,5]$
Planet mass, M_P $[{M}_{\mathrm{Jup}}]$	${M}_{P}\sim { \mathcal U }[0.25,2.5]$
Planet radius, R_P $[{R}_{\mathrm{Jup}}]$	${R}_{P}\sim { \mathcal U }[0.7,1.8]$
Inclination, i $[^\circ ]$	$i\sim { \mathcal U }[45,90]$
Mean geometric albedo, $\langle {A}_{g}\rangle$	$\langle {A}_{g}\rangle \sim { \mathcal U }[0.03,0.35]$
Albedo covariance amplitude, ${{h}_{A}}^{2}$	${{h}_{A}}^{2}\sim { \mathcal U }[0,0.015]$
Mean peak offset, $\langle \theta \rangle$	$\langle \theta \rangle \sim { \mathcal U }[-0.1,0.05]$
θ covariance amplitude, ${{h}_{\theta }}^{2}$	${{h}_{\theta }}^{2}\sim { \mathcal U }[0,0.0025/{\langle \theta \rangle }^{2}]$
Coherence timescale, τ [days]	$\tau \sim { \mathcal U }[10,50]$

Download table as: ASCII Typeset image

Table 2. A Catalog of 60 High-probability Non-transiting Hot Jupiter Candidates

KIC	K_p	${T}_{\mathrm{eff}}\ ({\rm{K}})$	${R}_{\star }\ ({R}_{\odot })$	${M}_{\star }\ ({M}_{\odot })$	Fe/H	P (days)	${A}_{\mathrm{refl}}$	${A}_{\mathrm{ellip}}$	$K\sin i\ ({\rm{m}}\,{{\rm{s}}}^{-1})$	Prob
2706947	13.7	6771	1.29	1.15	−0.44	2.5762 ± 0.0018	48	3.9	150 ± 144	0.981
2708787	15.1	4243	0.63	0.62	−0.12	1.8296 ± 0.0008	49.6	6	326 ± 188	0.987
3217078	15.7	4780	0.77	0.75	0.14	1.853 ± 0.0008	50.7	9.2	389 ± 200	0.984
3347307	14.6	5955	0.91	1	−0.16	1.8329 ± 0.0008	31.5	5.7	251 ± 248	0.992
3539728	15.3	6075	0.97	1.05	−0.12	1.6002 ± 0.0006	42.9	8.1	257 ± 268	0.982
3964318	14.1	4544	0.64	0.71	−0.08	1.7412 ± 0.0007	28	2.3	133 ± 108	0.977
4753174	14.3	6592	1.19	1.21	−0.12	1.685 ± 0.0008	47.4	2.2	55 ± 63	0.981
4914087	15	5505	0.88	0.83	−0.18	2.2909 ± 0.0017	22.8	9.8	496 ± 372	0.984
5001685	14	5713	1.01	1.01	0.18	2.2906 ± 0.0014	15.1	4.8	221 ± 108	0.991
5354490	14.7	5638	0.97	1	0.21	2.6102 ± 0.0016	47.1	2.1	130 ± 112	0.992
5435675	15.5	6032	0.97	1.05	−0.06	1.5229 ± 0.0006	29.5	1.6	45 ± 85	0.987
5479689	13.8	5360	0.91	0.92	0.24	1.7012 ± 0.0008	15.5	5	158 ± 111	0.987
5566612	14.9	5799	1.1	0.97	0	2.7099 ± 0.002	27.1	11.4	513 ± 454	0.972
5597644	14.9	4608	0.67	0.66	−0.22	1.6228 ± 0.0007	38.4	11.3	473 ± 170	0.985
5716330	13.9	5952	1.3	1.05	0	1.5689 ± 0.0006	29.8	4.6	57 ± 32	0.996
5781247	15.6	5561	0.85	0.97	0.07	1.6547 ± 0.0008	37.1	1.5	59 ± 112	0.974
5878307	13.6	6856	1.42	1.28	−0.26	2.0466 ± 0.0013	18.9	5	116 ± 112	0.978
6047853	13.7	6422	1.42	1.22	−0.06	3.4193 ± 0.0029	23.3	2.8	129 ± 83	0.985
6065597	13.7	5726	1.18	0.92	−0.08	3.9872 ± 0.0045	20.4	4	254 ± 232	0.987
6388466	14.5	5988	1.07	0.99	−0.12	1.269 ± 0.0004	43.2	4.4	64 ± 64	0.993
6603087	15.4	5522	0.87	0.97	0.14	1.3457 ± 0.0005	35.3	1.9	50 ± 82	0.975
6613542	14.1	6113	0.98	0.96	−0.34	4.4023 ± 0.005	20.8	1.6	240 ± 269	0.976
6675953	13.7	5565	0.82	0.8	−0.36	2.9651 ± 0.0023	20.4	1.4	135 ± 118	0.988
6783562	13.6	6930	1.19	1.22	−0.32	2.0883 ± 0.0014	42.7	6.3	243 ± 267	0.992
6976754	14.5	6607	1.2	1.22	−0.1	1.4196 ± 0.0005	56.6	3.5	64 ± 73	0.989
7045031	15.8	5445	0.8	0.78	−0.34	1.2886 ± 0.0007	52.3	1.9	44 ± 76	0.976
7512130	14.3	6108	1.03	1.13	0.07	1.8712 ± 0.0008	44.7	1.2	47 ± 64	0.996
7596734	15	6078	0.88	0.96	−0.38	2.5893 ± 0.0016	21.6	2.8	230 ± 264	0.989
7735171	15.2	5738	0.95	0.97	0.02	1.8334 ± 0.0008	37.2	5.1	186 ± 190	0.984
7938689	15.4	5269	0.92	0.83	0.07	1.841 ± 0.0009	48.6	4.9	148 ± 125	0.991
7978458	15	6188	1.08	1.03	−0.14	1.4349 ± 0.0005	29.2	4	77 ± 82	0.986
8026887	13.9	5988	1.22	1.01	−0.08	1.9232 ± 0.0009	35.1	11.3	234 ± 113	0.995
8042004	13.3	6223	1.64	1.3	0.14	1.9498 ± 0.0011	14.5	3.3	42 ± 39	0.991
8098079	15.1	6198	1	1.07	−0.16	1.2873 ± 0.0004	34.4	4.3	89 ± 106	0.992
8121913	11.7	5702	1.45	1.03	0.14	3.2943 ± 0.0003	18.5	2.1	61 ± 66	0.988
8122501	14.7	5543	0.82	0.73	−0.54	1.029 ± 0.0029	64	6.9	98 ± 79	0.98
8358253	14.5	5985	0.85	0.94	−0.36	1.5602 ± 0.0006	33.7	16	609 ± 541	0.996
8364428	13.4	6525	1.59	1.05	−0.56	1.5737 ± 0.0008	49.9	12.6	96 ± 105	0.985
8389838	15	5725	0.94	0.87	−0.24	2.0092 ± 0.0011	23.5	6	226 ± 204	0.972
8410490	14.1	5821	0.87	0.91	−0.28	1.3621 ± 0.0004	23.3	3.4	88 ± 83	0.994
8499974	14.9	5952	0.84	0.93	−0.36	1.4916 ± 0.0005	35.9	6.6	238 ± 222	0.981
8559208	14.5	6029	0.82	0.92	−0.5	2.1802 ± 0.0012	25.9	5.1	368 ± 338	0.992
8752590	15.6	5750	0.95	0.97	0	1.088 ± 0.0003	58.4	6.6	99 ± 100	0.98
8832613	15.8	4571	0.66	0.67	−0.2	1.5788 ± 0.0008	53.8	9.1	393 ± 187	0.971
8847921	14.7	5739	1.14	0.93	−0.06	1.7975 ± 0.0008	28	1.7	32 ± 46	0.985
8885337	13.9	6950	1.39	1.43	0.07	2.2199 ± 0.0013	37.8	4.2	139 ± 171	0.978
9040864	14.7	5774	1.13	0.92	−0.12	2.1063 ± 0.0013	46.2	3.2	81 ± 90	0.971
9287646	14.1	6258	1.03	1.09	−0.16	2.5591 ± 0.0018	18.9	9	568 ± 569	0.986
9475045	15.1	6143	1.12	1.19	0.21	1.173 ± 0.0003	38.4	3.6	51 ± 63	0.995
9570402	14	6103	0.95	1.03	−0.2	1.683 ± 0.0007	48.1	4.3	156 ± 151	0.994
9724972	14.2	6017	0.99	0.91	−0.38	2.0872 ± 0.0013	32.6	6.7	256 ± 227	0.978
10068024	13.1	6349	1.06	1.08	−0.22	2.0735 ± 0.0012	20.6	5.6	229 ± 243	0.985
10091175	14.5	6063	1.01	1.12	0.07	1.8757 ± 0.0009	39.4	9.9	394 ± 385	0.991
10619862	14.5	5783	0.88	0.98	−0.08	3.0538 ± 0.0025	33.7	1.8	190 ± 247	0.988
10931452	12.7	6666	1.54	1.14	−0.38	2.7001 ± 0.0023	21.7	4.5	108 ± 91	0.971
11152428	13.8	6774	1.35	1.25	−0.26	1.1726 ± 0.0003	33.6	8.8	89 ± 85	0.982
11235536	15.5	5631	0.81	0.88	−0.24	1.4585 ± 0.0005	54.7	8.8	307 ± 262	0.992
11362225	12	6623	1.9	1.45	0.04	2.7513 ± 0.0021	42.4	5.9	103 ± 71	0.995
12207153	14.8	5569	0.89	1.01	0.21	1.4225 ± 0.0005	30.2	8.6	257 ± 217	0.979
12357100	15.1	5880	0.92	1.01	−0.06	2.6613 ± 0.0021	36.7	3.2	258 ± 302	0.979

Download table as: ASCII Typeset image

There are three additional parameters in Table 1 that have not yet been introduced: ${{h}_{A}}^{2}$ , ${{h}_{\theta }}^{2}$ , and τ. These are related to modulations we impose on the planet's albedo and phase offset. This variability is motivated by the assumption that planets experience surface feature evolution and atmospheric flow that causes their spatially dependent reflectivity to change slightly over time, which in turn influences the amplitudes and shapes of their observed phase curves. While the magnitude and timescale of this effect is unknown, a fractional albedo variation of up to $\sim 30 \%$ seems plausible (Rauscher et al. 2008; Armstrong et al. 2016), and phase offset variations on the order of ∼0.15 have been observed (Armstrong et al. 2016). Moreover, the coherence timescale in the observations of weather on the transiting hot Jupiter HAT-P-7b is tens to hundreds of days (Armstrong et al. 2016).

To model the albedo and phase offset variability, we use random draws from a Gaussian Process (GP). GPs are used as a non-parametric method of modeling a function in some continuous input space (Gelman et al. 2014). Every realization of a GP is a random variable drawn from a multivariate normal distribution with a mean vector and covariance matrix, where the matrix is constructed from a covariance function that dictates the shrinkage toward the mean and the correlation between pairs of data points. A widely used covariance function is the squared exponential, given by

$\begin{eqnarray}&&k({t}_{i},{t}_{j})={h}^{2}\exp \left(-\displaystyle \frac{{({t}_{i}-{t}_{j})}^{2}}{2{\tau }^{2}}\right).\end{eqnarray} \tag{ 11 }$

In the case of modeling an albedo or phase offset variation, t is time, h² is the maximum covariance of the variation, and τ is a timescale dictating the smoothness of the variability. ${\boldsymbol{K}}$ is the covariance matrix built from the covariance function, given by

$\begin{eqnarray}{\boldsymbol{K}}=\left[\begin{array}{cccc}k({t}_{1},{t}_{1}) & k({t}_{1},{t}_{2}) & ... & k({t}_{1},{t}_{n})\\ k({t}_{2},{t}_{1}) & k({t}_{2},{t}_{2}) & ... & k({t}_{2},{t}_{n})\\ \vdots & \vdots & \ddots & \vdots \\ k({t}_{n},{t}_{1}) & k({t}_{n},{t}_{2}) & ... & k({t}_{n},{t}_{n})\end{array}\right].\end{eqnarray} \tag{ 12 }$

Let ${{\boldsymbol{x}}}_{A}=({x}_{A1},{x}_{A2},...{x}_{{An}})$ be a vector representing the fractional deviation of the albedo from its mean at n discrete timesteps. In other words,

$\begin{eqnarray}&&{x}_{{Ai}}={x}_{A}({t}_{i})=\displaystyle \frac{{A}_{g}({t}_{i})-\langle {A}_{g}\rangle }{\langle {A}_{g}\rangle }.\end{eqnarray} \tag{ 13 }$

Let ${{\boldsymbol{x}}}_{\theta }$ be the analogous vector for the phase offset. We model ${{\boldsymbol{x}}}_{A}$ and ${{\boldsymbol{x}}}_{\theta }$ as independent multivariate Gaussian random variables: ${{\boldsymbol{x}}}_{A}\sim { \mathcal N }(0,{{\boldsymbol{K}}}_{A})$ and ${{\boldsymbol{x}}}_{\theta }\sim { \mathcal N }(0,{{\boldsymbol{K}}}_{\theta })$ . The mean vector is zero because $E[{x}_{{Ai}}]=E[{x}_{\theta i}]=0$ .

Figure 5 illustrates several realizations of the GP for ${h}^{2}\,=0.0075,\tau =10$ days (top panel) and ${h}^{2}=0.015,\tau =20$ days (bottom panel). In this figure, ${\boldsymbol{x}}$ could represent either ${{\boldsymbol{x}}}_{A}$ or ${{\boldsymbol{x}}}_{\theta }$ .

We construct a synthetic phase curve for each light curve as follows. We draw a set of parameters from the distributions in Table 1. The covariance matrices, ${{\boldsymbol{K}}}_{A}$ and ${{\boldsymbol{K}}}_{\theta }$ , are constructed using the same timescale, τ, but they use their respective covariance amplitudes, ${{h}_{A}}^{2}$ and ${{h}_{\theta }}^{2}$ . We draw x_A(t) and ${x}_{\theta }(t)$ as GP realizations. For convenience, we make use of the george³ code for GP regression (Ambikasaran et al. 2014). Given the random draws for $\langle {A}_{g}\rangle$ and $\langle \theta \rangle$ , the albedo variation function is calculated as ${A}_{g}(t)=\langle {A}_{g}\rangle [1+{x}_{A}(t)]$ and the phase offset $\theta (t)=\langle \theta \rangle [1+{x}_{\theta }(t)]$ . These time-dependent quantities and the rest of the parameters are then inserted into the phase curve model outlined in Section 3.1. Finally, the model phase curve is multiplied by the Kepler PDC flux to create the synthetic light curve.

3.3. Recovery Efficiency

We now wish to evaluate the efficiency with which these synthetic phase curves are recovered by the pipeline outlined in Section 2. We ran the pipeline on each of the 10,000 synthetic injections. We defined the synthetic phase curve to be "recovered" when the period of the detected signal was within 0.01 days of the injected signal. In the upper panel of Figure 6, we plot the histograms of the phase curve semi-amplitudes for all injections and for recovered injections only. The shape of the histogram for the injected data has little physical significance other than to reflect the typical amplitude of the injected phase curves. We show the fraction of recovered injections in the lower panel of the figure. The recovery efficiency is a smooth and increasing function of amplitude, with 20 ppm amplitude phase curves being recovered $\sim 50 \%$ of the time.

**Figure 6.** Recovery efficiency of the pipeline. In the top panel we plot histograms of the injected phase curve semi-amplitudes for all injections (in green) and for the injections that were recovered (in blue). In the lower panel we show the ratio of the histograms, yielding the fraction recovered as a function of amplitude.
Download figure:
Standard image High-resolution image

For the injections that were recovered, we plot in Figure 7 the semi-amplitude of the detected signal versus that of the injected signal. The relationship makes clear that the pipeline attains a reasonably accurate recovery of the phase curve amplitudes. The bias toward underestimation at large amplitude arises because the pipeline uses sinusoidal fits that average over ellipsoidal features in the light curve.

4. Binary Classification Using Logistic Regression

Now that we have demonstrated that the pipeline is capable of retrieving the signals of synthetic injected phase curves, we proceed to classify light curves by employing a logistic regression algorithm in a supervised machine learning context. We aim to have the algorithm "learn" the properties of planetary phase curves and use this information to determine whether phase curves are present in newly introduced light curves. We offer an introduction to logistic regression in Section 4.1 and apply it to phase curve classification in Section 4.2.

4.1. Logistic Regression Overview

Logistic regression is a technique used to predict a categorical dependent variable based on observed characteristics called explanatory variables or predictors (Gelman et al. 2014). The dependent variable is typically a binary response (e.g., pass/fail, healthy/sick). Like other forms of regression, there may be any number of predictors, and they can be either continuous or discrete.

Consider a data set of length n containing binary response dependent variables, Y_i, which may take on the values 0 or 1. Assume each response variable is associated with m explanatory variables: ${x}_{1i},{x}_{2i},...,{x}_{{mi}}$ . The subscript i indicates an index corresponding to each outcome in the data set. Let the (unknown) probability of success of the outcome be p_i, where p_i is unique to each outcome, but is related to the explanatory variables:

$\begin{eqnarray}&&E[{Y}_{i}\ | \ {x}_{1i},{x}_{2i},...,{x}_{{mi}}]={p}_{i}.\end{eqnarray} \tag{ 14 }$

The outcomes are thus Bernoulli random variables,

$\begin{eqnarray}&&{Y}_{i}\ | \ {x}_{1i},{x}_{2i},...,{x}_{{mi}}\sim \mathrm{Bernoulli}({p}_{i}),\end{eqnarray} \tag{ 15 }$

with a probability mass function,

$\begin{eqnarray}&&P({Y}_{i}={y}_{i}\ | \ {x}_{1i},{x}_{2i},...,{x}_{{mi}})={{p}_{i}}^{{y}_{i}}{(1-{p}_{i})}^{1-{y}_{i}}.\end{eqnarray} \tag{ 16 }$

In a logistic regression model, we assume that the probability of success varies systematically as a function of the explanatory variables:

$\begin{eqnarray}&&\mathrm{logit}({p}_{i})=\mathrm{ln}\left(\tfrac{{p}_{i}}{1-{p}_{i}}\right)={\boldsymbol{\beta }}\cdot {{\boldsymbol{x}}}_{i}.\end{eqnarray} \tag{ 17 }$

Here, ${\boldsymbol{\beta }}$ is the vector of regression coefficients,

$\begin{eqnarray}&&{\boldsymbol{\beta }}=({\beta }_{0},{\beta }_{1},{\beta }_{2},...,{\beta }_{m}),\end{eqnarray} \tag{ 18 }$

and ${{\boldsymbol{x}}}_{i}$ is the vector of explanatory variables,

$\begin{eqnarray}&&{{\boldsymbol{x}}}_{i}=(1,{x}_{1i},{x}_{2i},...,{x}_{{mi}}).\end{eqnarray} \tag{ 19 }$

A value of 1 has been appended as the first element corresponding to the intercept coefficient, ${\beta }_{0}$ . Note that when a given regression coefficient, ${\beta }_{j}$ , is zero, the outcome is independent of the corresponding explanatory variable, x_ji.

Rewriting Equation (17) in terms of p_i yields

$\begin{eqnarray}&&{p}_{i}={\mathrm{logit}}^{-1}\,({\boldsymbol{\beta }}\cdot {{\boldsymbol{x}}}_{i})=\displaystyle \frac{1}{1+{e}^{-{\boldsymbol{\beta }}\cdot {{\boldsymbol{x}}}_{i}}}.\end{eqnarray} \tag{ 20 }$

Using Equations (16) and (20), one can show that the log-likelihood $l({\boldsymbol{\beta }})$ can be expressed as

$\begin{eqnarray}l({\boldsymbol{\beta }}) & = & \mathrm{ln}\left[\displaystyle \prod _{i=1}^{n}p({y}_{i}\ | \ {{\boldsymbol{x}}}_{i})\right]\\ & = & \displaystyle \sum _{i=1}^{n}[(1-{y}_{i})\mathrm{ln}(1+{e}^{{\boldsymbol{\beta }}\cdot {{\boldsymbol{x}}}_{i}})-{y}_{i}\mathrm{ln}(1+{e}^{-{\boldsymbol{\beta }}\cdot {{\boldsymbol{x}}}_{i}})].\end{eqnarray} \tag{ 21 }$

The regression is solved via maximum likelihood estimation (MLE) by identifying the vector of regression coefficients, $\hat{{\boldsymbol{\beta }}}$ , that maximizes $l({\boldsymbol{\beta }})$ . In a Bayesian context, one can impose a prior distribution on the regression coefficients, $\pi ({\boldsymbol{\beta }})$ . When the vector of coefficient estimators is found, the unknown probabilities, p_i, may be calculated with Equation (20). More importantly, one may estimate the probabilities of unknown outcomes if provided the observations of the relevant explanatory variables.

In the analysis that follows, we perform logistic regression using the Scikit-learn library⁴ (Pedregosa et al. 2011). We use a regularization strength, $\alpha =1$ , and we use Newton-CG optimization to determine the regression coefficients.

4.2. Phase Curve Classification Using Logistic Regression

We now apply the logistic regression model to the problem of classifying phase curve signals in Kepler light curves. The response variable, Y_i, represents whether the detected signal is a phase curve (Y_i = 1) or not (Y_i = 0). The probability, p_i, is the probability that a given detection is a true planetary phase curve. (Note this is different from the probability that a phase curve is present in the light curve.)

We used a collection of m = 12 predictors, many of them introduced in Section 2:

1.
Signal period, P
2.
Signal amplitude
3.
Chi-square ratio of the two-component to one-component phase curve fits, ${{\chi }^{2}}_{2}/{{\chi }^{2}}_{1}$
4.
Durbin–Watson statistic ratio, ${d}_{2}/{d}_{1}$
5.
Q_h versus h slope ratio, ${m}_{2}/{m}_{1}$
6.
Q_h versus h y-intercept ratio, ${b}_{2}/{b}_{1}$
7.
Full-width at half maximum (FWHM) of the peak at P in the Lomb–Scargle periodogram
8.
Significance of the peak in the Lomb–Scargle periodogram
9.
Significance of the peak in the "local significance" periodogram
10.
Normalized significance of the $P/2$ harmonic peak in the Lomb–Scargle periodogram
11.
Normalized significance of additional peaks in the periodogram
12.
Deviation of the phase-folded light curve from sinusoidality

Predictors 1–9 were discussed in Section 2. Here we introduce predictors 10–12, which were all motivated by inspection of the signals in the synthetic data sets.

1.
Predictor 10: true phase curves should exhibit some power in the periodogram at the $P/2$ harmonic due to the ellipsoidal variation component of the phase curve. We quantify this with predictor 10 by finding the significance of the $P/2$ harmonic peak in the Lomb–Scargle periodogram and taking its ratio with respect to the peak at P.
2.
Predictor 11: for true phase curve detections, no significant peaks should remain in the periodogram after the signals at P and $P/2$ are removed. For non-phase curve signals such as astroseismic pulsations (e.g., Grigahcène et al. 2010), significant peaks are accompanied by additional peaks. To calculate predictor 11, we start with the Lomb–Scargle periodogram and remove the peaks at the signal period, P, and at the harmonic, $P/2$ . We then calculate the local significance periodogram, find the highest peak, and normalize it with respect to the height of the peak at P.
3.
Predictor 12: large-amplitude phase curves should tend to exhibit a deviation from sinusoidality due to the presence of the beaming and ellipsoidal components. We quantify this tendency with predictor 12. We first use the fit in Section 2.2 to detrend the light curve. We phase-fold it at the candidate period, bin it, fit a sinusoid to the phase-folded signal, and calculate the residuals of the fit. The predictor is then the Durbin–Watson statistic of the residuals (see Section 2.3). This quantifies the autocorrelation in the residuals, or the deviation from sinusoidality.

All 12 predictors were shown to exhibit significant differences between phase curve detections and non-phase curve detections in the synthetic data sets, either independently or in combination with other predictors. We show an example of this using a triangle diagram in Figure 8. This figure shows the correlations between four of the predictors: predictor 3, ${{\chi }^{2}}_{2}/{{\chi }^{2}}_{1};$ predictor 4, ${d}_{2}/{d}_{1};$ predictor 5, ${m}_{2}/{m}_{1};$ and predictor 11, x₁₁. Plotted in blue are the recovered phase curve injections in the synthetic data. In green we show the detections that did not correspond to the injected phase curve.

**Figure 8.** Triangle plot showing the correlations between four of the 12 predictors: predictor 3, ${{\chi }^{2}}_{2}/{{\chi }^{2}}_{1};$ predictor 4, ${d}_{2}/{d}_{1};$ predictor 5, ${m}_{2}/{m}_{1};$ and predictor 11, x₁₁. The points represent the synthetic light curves generated in Section 3.2. Blue points are recovered phase curve injections. Green points are "non-phase curves," that is, detections that did not match the injected phase curve. Broadly speaking, the predictors show very significant differences between the two populations.
Download figure:
Standard image High-resolution image

Clearly, there are significant differences in the two populations. Consider, for example, the top left plot of ${{\chi }^{2}}_{2}/{{\chi }^{2}}_{1}$ versus ${d}_{2}/{d}_{1}$ . Both phase curves and non-phase curves exhibit ${{\chi }^{2}}_{2}/{{\chi }^{2}}_{1}\lt 1$ because the two-component fit from Section 2.3 has more degrees of freedom than the one-component fit. However, ${d}_{2}/{d}_{1}$ for phase curves tends to be larger and follows a straight line because the two-component fit removes the residual autocorrelations best when the signal is a true phase curve, or a near-sinusoid. (We recall that as the amount of positive autocorrelation decreases, d increases and m decreases. For strong phase curve detections, we should therefore expect ${d}_{2}/{d}_{1}\gg 1$ and ${m}_{2}/{m}_{1}\ll 1$ .) Generally speaking, the separation between the phase curve and non-phase curve populations continues at smaller scales (where the points are overlapping on the figure), but it is less pronounced.

Given the distinct differences in the predictors between signals that are or are not phase curves, the multivariate logistic regression should be capable of picking up on these differences and classifying light curves accordingly. We describe this process next.

4.3. Testing on Synthetic Phase Curves

We applied the logistic regression to the set of 10,000 synthetic phase curves that we produced in Section 3.2. We first randomly split the synthetic data set into a training set and a test set, with the test set being 5% of the total. We then regressed on the training set using the 12 predictors introduced in the previous section and the outcomes ${Y}_{i}$ that indicate whether each detected signal of the training group was the injected phase curve or not. Finally, we ran predictions on the test set, supplying the 12 predictors and calculating the predicted probabilities that each member of the test set was or was not a phase curve. The predictions were then compared to the actual results, that is, the outcomes of whether each signal detected by the pipeline was the injected phase curve or not.

The results of the test are shown in Figure 9. On the x-axis we plot the semi-amplitude of the signal derived by the two-component fit described in Section 2.2. (Note that this amplitude is half the peak-to-peak amplitude.) The y-axis is the probability predicted by the logistic regression that the detection is a true phase curve. Black dots are correct predictions, and red dots are incorrect predictions. More explicitly, black dots with $P(\mathrm{phase}\ \mathrm{curve})\gt 0.5$ are cases where the logistic regression predicted that the signal detected by the pipeline was a phase curve, and the signal did indeed correspond to the injected phase curve. Red dots with $P(\mathrm{phase}\ \mathrm{curve})\gt 0.5$ are cases where the logistic regression predicted that the signal was a phase curve, but the signal did not correspond to the injected phase curve. The situation is reversed for points below the $P(\mathrm{phase}\ \mathrm{curve})=0.5$ line.

Generally speaking, the logistic regression performs remarkably well. In total, correct predictions are made $\sim 90 \%$ of the time, and the accuracy increases steeply as $P(\mathrm{phase}\ \mathrm{curve})$ approaches 0 or 1. Larger amplitude detections are assigned more definitive probabilities (i.e., $P(\mathrm{phase}\ \mathrm{curve})$ closer to 0 or 1).

We tested the consistency of the algorithm's predictive performance using 5000 realizations of the training and test procedure described in the first paragraph of this section. Each realization used a random partition of the synthetic data set into the 95% training and 5% test sets. For each trial, we performed the logistic regression and identified the correct and incorrect predictions. Histograms of the fraction of correct predictions are shown in Figure 10. The green histogram shows that the algorithm predicts correctly $\sim 91 \%$ of the time. For cases with reported probabilities greater than 0.9 (purple histogram), the prediction is correct $\sim 98 \%$ of the time. As indicated by the widths of the histograms, these results are fairly consistent from trial to trial.

**Figure 10.** Histograms of the fraction of correct predictions for 5000 random realizations of the synthetic data training and testing procedure. The green histogram shows each realization's fraction of predictions that are correct. The purple histogram is restricted to predictions with probabilities greater than 0.9.
Download figure:
Standard image High-resolution image

We also computed the fraction of correct predictions as a function of the probability reported by the logistic regression. The mean curve over the 5000 random realizations is displayed in Figure 11. The results indicate that the reported probabilities can indeed be interpreted as the likelihood that a given signal corresponds to a true phase curve.

**Figure 11.** Fraction of correct predictions vs. the probability reported by the logistic regression. The dark green curve is the mean over the 5000 realizations, and the surrounding lighter green band shows the standard deviation. The standard deviation is largest toward the center because most points in a given realization are predicted with probabilities near 0 or 1 (see Figure 9). The dashed black line is the probability theory expectation.
Download figure:
Standard image High-resolution image

4.4. Testing on Transiting Kepler Hot Jupiters

We also tested the detection pipeline and logistic regression on the collection of transiting hot Jupiters in the Kepler field. We first compiled the list of KOIs in the Q1-Q17 DR 24 catalog (Coughlin et al. 2016) from the NASA Exoplanet Archive⁵ (Akeson et al. 2013). We filtered the confirmed planets and planet candidates to those with periods in a 1–5 day range and radii in the range 0.5–3 ${R}_{\mathrm{Jup}}$ . For the unconfirmed planet candidates, we used the KOI false positive probabilities (FPPs) calculated by Morton et al. (2016) to restrict the list to those with an FPP lower than 10%. These steps resulted in a collection of 59 total hot Jupiter planets and planet candidates with potentially detectable phase curves.

The pipeline described in Section 2 was executed in the same manner as on the synthetic data sets. The only difference was that the transits and (potential) secondary eclipses were removed from the light curves before processing them. The pipeline recovered planetary phase curves in four cases: HAT-P-7b (also called Kepler-2b), Kepler-13b, Kepler-41b, and Kepler-76b. In these cases, the period of the signal recovered by the pipeline matched the planet period to within 0.01 days. This apparently low recovery rate is not surprising. Esteves et al. (2015) conducted a comprehensive search for planetary phase variations in Kepler transiting planets, finding 14 planets in total. Twelve of these have parameters fitting our criteria ( $1\ \mathrm{day}\lt P\lt 5\ \mathrm{days}$ and $0.5\ {R}_{\mathrm{Jup}}\lt {R}_{{\rm{P}}}\lt 3\ {R}_{\mathrm{Jup}}$ ). In this sense, there are 12 potentially detectable phase curves, and one-third of these have amplitudes large enough to be detected by our pipeline.

Recovery by the pipeline is distinct from recovery by the logistic regression. The next step is to consider predictions of the transiting data set to confirm that the logistic regression places high probabilities on the four recovered phase curve signals and low probabilities on the rest.

Using the regression that was performed on the synthetic data in the previous section, we made predictions on the light curves of the transiting planets. The results are shown in Figure 12. In all four cases where the pipeline recovered the planet's phase curve, the logistic regression correctly predicted that the detection was a true phase curve. This is represented by the four annotated black points with $P(\mathrm{phase}\ \mathrm{curve})\gt 0.5$ . In all but one of the remaining cases, the logistic regression correctly predicted that the signal detected by the pipeline was not a phase curve. In other words, all of the black points with $P(\mathrm{phase}\ \mathrm{curve})\lt 0.5$ were cases where the pipeline did not recover the planet's period, and the logistic regression correctly reported that the signals were not real phase curves.

**Figure 12.** Results of the logistic regression on a set of transiting planets in the *Kepler* field with $1\ \mathrm{day}\lt P\lt 5\ \mathrm{days}$ and $0.5\ {R}_{\mathrm{Jup}}\lt {R}_{{\rm{P}}}\lt 3\ {R}_{\mathrm{Jup}}$ . On the x-axis we plot the semi-amplitude of the signal detected by the pipeline in Section 2.2. On the y-axis we show the probability predicted by the logistic regression that the detection is a true phase curve. The horizontal dashed line at 0.5 is the boundary between where a given signal is predicted to be a phase curve or not. The four annotated black points with P(phase curve) $\gt 0.5$ correspond to the cases where the pipeline recovered the planetary phase curve, and the logistic regression correctly identified these signals to be true phase curves. The black points with P(phase curve) $\lt 0.5$ are cases where the signal detected by the pipeline did not correspond to the planetary period, and the logistic regression correctly predicted that the signals were not true phase curves.
Download figure:
Standard image High-resolution image

The one false positive pictured in the upper right corner of Figure 12 has an amplitude greater than that expected from a planetary phase curve. The signal is from the light curve of KIC 5651104/KOI-840/Kepler-695b. UKIRT images available from the Kepler Exoplanet Follow-up Observing Program (ExoFOP)⁶ show that the star has nearby background or foreground companions, so the signal probably results from blended non-eclipsing binary stars. We found further evidence for this interpretation in that the phase curve amplitude is inconsistent when the light curve is folded at twice the detected period. Details on the nature of this inconsistency and steps taken to rule out these types of false positives are discussed in Section 5.1.

In short, the logistic regression performs exceedingly well on both synthetically generated data sets and the light curves of transiting Kepler planets. In the vast majority of cases, the algorithm correctly predicts when a signal is or is not a true planetary phase curve. This gives us confidence in the process of applying the logistic regression to novel light curves and inspecting them for phase curve detections.

5. Application to Kepler FGK Stars

We now describe the application of our pipeline and logistic regression algorithm to the search for non-transiting planets around Kepler FGK stars without confirmed planets or KOIs. We first downloaded the Q1-Q17 DR 25 Kepler stellar catalog (Mathur et al. 2017) from the NASA Exoplanet Archive⁵ (Akeson et al. 2013). We filtered the target stars for FGK main-sequence stars, using the criteria from Christiansen et al. (2016): $4000\ {\rm{K}}\lt {T}_{\mathrm{eff}}\lt 7000\ {\rm{K}}$ and $\mathrm{log}\,g\gt 4.0$ . We then removed stars with known planets and planetary candidates, leaving 146,980 targets. For these targets, we retrieved all 17 quarters of Kepler long-cadence PDC photometry. These light curves are publicly available at the Mikulski Archive for Space Telescopes (MAST).

We then ran the pipeline outlined in Section 2 on the set of light curves, with successful convergence on 142,630 of them. This procedure identified the most likely phase curve signal in each light curve, if a phase curve were to be present. Using the logistic regression that was performed on the synthetic data in Section 4.3, we then made predictions on this novel set of light curves, evaluating the likelihood that each detected signal could correspond to a true phase curve.

The results are displayed in Figure 13. This plot is analogous to Figures 9 and 12, with the x-axis being the semi-amplitude of the signal derived by the two-component fit described in Section 2.2 and the y-axis being the probability output by the logistic regression described in Section 4.2. The points have been colored by their density, and the panel at the right is a histogram of the probabilities on a log scale. As expected and required, the vast majority of targets have $P(\mathrm{phase}\ \mathrm{curve})\lt 0.5$ .

5.1. Binary Star Filtering

Of the potential sources for astrophysical false positives, short-period binary stars are one of the most significant concerns. Binary stars have an occurrence rate of $\gtrsim 10 \%$ at periods of 1–5 days (Kirk et al. 2016), about two orders of magnitude greater than the occurrence rate of hot Jupiters with phase variations detectable by our pipeline. Their phase curves are dominated by ellipsoidal variations (Faigler et al. 2012), meaning that our pipeline would detect them at half their orbital period. Although the amplitudes of their ellipsoidal variations are typically 100–1000 ppm, the amplitude is given by ${A}_{\mathrm{ellip}}\propto {\sin }^{2}i$ (see Equation (9)), so inclined systems could appear with amplitudes smaller than 100 ppm. Moreover, potential background or foreground stars could contaminate the photometric aperture and make the ellipsoidal variations appear with small amplitudes.

Fortunately, there are means of distinguishing between a binary system and a giant planet phase curve, even at small amplitudes. Because ellipsoidal variations dominate the binary stars' light curve, they would be detected at half their orbital period. However, the Doppler beaming amplitude is non-negligible (Faigler et al. 2012), meaning that the periodogram of a binary light curve should show significant power at the true period, or twice the detected period. Planetary phase curves should not show any signal at $2P$ . Furthermore, when the light curve is folded at twice the detected period and split at the midpoint, the beaming amplitude of a binary system will cause the two halves to be distinct, but a planetary signal will have two identical curves. To address these features and eliminate binary stars, we constructed the following three filters. The values of the thresholds were determined by inspection of the synthetic data sets.

1.
Require ${\mathrm{power}}_{2P}/{\mathrm{power}}_{P}\lt 0.4$ , where P is the detected period (or, in the case of binaries, half the true period).
2.
If the light curve is folded and binned at $2P$ , and if y_i represents the data in the first half and z_i represents the second half, then require $0.8\lt r\lt 1.2$ , where $r=\tfrac{\tfrac{1}{N}{\sum }_{i=1}^{N}{({y}_{i}-{z}_{i})}^{2}}{{{\sigma }_{y}}^{2}+{{\sigma }_{z}}^{2}}$ , ${{\sigma }_{y}}^{2}={\sum }_{i=1}^{N}{{\sigma }_{{yi}}}^{2}$ and ${{\sigma }_{z}}^{2}={\sum }_{i=1}^{N}{{\sigma }_{{zi}}}^{2}$ . We take N = 200, such that 400 points span the $2P$ folded light curve.
3.
Reqiure $\tfrac{\mathrm{Amp}}{300\ \mathrm{ppm}}\lt {\left(\tfrac{P}{1\mathrm{day}}\right)}^{-4/3}$ , where Amp is the semi-amplitude.

To apply these filters to the sample, we eliminated all points with $P\gt 0.5$ that did not adhere to these limits. The results of these filters are shown in Figure 16, where they are discussed in conjunction with the additional vetting in the next section.

5.2. Further Vetting via Physical Plausibility Considerations

In addition to binary stars, there are likely additional types of false positives that are not being taken into account (Shporer 2017). It is therefore worthwhile to take further steps to address these.

Another powerful method for false positive vetting is to consider whether the signal morphologies are physically plausible given the expectations for planetary phase curves. Similar to the binary star filtering in the previous section, the additional vetting that we now describe was only applied to cases with predicted probabilities $P\gt 0.5$ (hereafter called "candidates").

We started with a fitting procedure for assessing the physical plausibility of the candidate phase curves. For each candidate, we calculated a one-component sinusoidal fit, representing a fit for reflected light only. We then calculated a three-component fit modeled by

$\begin{eqnarray}f(\phi ) & = & 1-{A}_{\mathrm{refl}}\cos (2\pi (\phi +{\theta }_{\mathrm{refl}}))\\ & & -{A}_{\mathrm{ellip}}\cos (4\pi (\phi +{\theta }_{\mathrm{ellip}}))\\ & & +{A}_{\mathrm{beam}}\sin (2\pi (\phi +{\theta }_{\mathrm{ellip}})).\end{eqnarray} \tag{ 22 }$

This is a simple way to assess the phase curve without applying the full phase curve model described in Section 3.1. The ${A}_{\mathrm{refl}}$ term represents the reflected light component with one phase, ${\theta }_{\mathrm{refl}}$ , and the ${A}_{\mathrm{ellip}}$ and ${A}_{\mathrm{beam}}$ terms are the ellipsoidal and beaming components with a separate phase, ${\theta }_{\mathrm{ellip}}$ . The thermal component is incorporated into the ${A}_{\mathrm{refl}}$ term, since the two cannot be separated without more advanced modeling. An example of the one-component and three-component phase curve fits for one of the candidates, KIC 5716330, is shown in Figure 14.

**Figure 14.** Example of the one-component fit (in green) and three-component fit (in purple) for KIC 5716330, one of the candidates. The reflected light, ellipsoidal, and beaming curves composing the three-component fit are displayed with dashed lines, and the phase offset, ${\rm{\Delta }}\theta ={\theta }_{\mathrm{refl}}-{\theta }_{\mathrm{ellip}}$ , is labeled. The fit indicates an eastward shift in the phase curve maximum, with the peak before the substellar point.
Download figure:
Standard image High-resolution image

**Figure 14.** Example of the one-component fit (in green) and three-component fit (in purple) for KIC 5716330, one of the candidates. The reflected light, ellipsoidal, and beaming curves composing the three-component fit are displayed with dashed lines, and the phase offset, ${\rm{\Delta }}\theta ={\theta }_{\mathrm{refl}}-{\theta }_{\mathrm{ellip}}$ , is labeled. The fit indicates an eastward shift in the phase curve maximum, with the peak before the substellar point.
Download figure:
Standard image High-resolution image

We then used these light curve fits to examine the candidates for false positives. We first calculated the Durbin–Watson statistic, d, of the residuals of the three-component fits (see Equation (1)). If the statistic is far from 2, this indicates autocorrelated residuals and implies that the light curve is not well described by the fit. We eliminated all candidates with $d\lt 1.85$ or $d\gt 2.15$ , where this range was determined by inspection of the synthetic phase curves.

Next, we examined the candidate phase curves' physical plausibility by considering ${A}_{\mathrm{refl}}$ and ${A}_{\mathrm{ellip}}$ . If, for example, ${A}_{\mathrm{refl}}$ is too large, the planet would have an abnormally large radius or albedo. If ${A}_{\mathrm{ellip}}/{A}_{\mathrm{refl}}$ is too large, the density would be abnormally high.

To convert ${A}_{\mathrm{refl}}$ and ${A}_{\mathrm{ellip}}$ into physical quantities, we first solve for the planet's radius in terms of ${A}_{\mathrm{refl}}$ and other observables:

$\begin{eqnarray}&&{R}_{p}={\left(\displaystyle \frac{2{A}_{\mathrm{refl}}}{{A}_{g}}\right)}^{1/2}{\left(\displaystyle \frac{{{GM}}_{\star }}{4{\pi }^{2}}\right)}^{1/3}{P}^{2/3}{(\sin i)}^{-1/2}.\end{eqnarray} \tag{ 23 }$

The factor of 2 in front of ${A}_{\mathrm{refl}}$ comes from the fact that ${A}_{\mathrm{refl}}$ is a semi-amplitude but Equation (7) considers a peak-to-peak amplitude.

Next, we solve for the planet's mean density in terms of ${A}_{\mathrm{ellip}}/{A}_{\mathrm{refl}}^{3/2}$ :

$\begin{eqnarray}&&{\rho }_{p}=\displaystyle \frac{{\rho }_{\star }}{2\sqrt{2}}\left(\displaystyle \frac{{A}_{\mathrm{ellip}}}{{A}_{\mathrm{refl}}^{3/2}}\right)\left(\displaystyle \frac{{{A}_{g}}^{3/2}}{{\alpha }_{2}}\right){(\sin i)}^{-1/2}.\end{eqnarray} \tag{ 24 }$

Here ${\alpha }_{2}$ is a constant in the ellipsoidal variation model (see Equation (9)). It varies nearly linearly with ${T}_{\mathrm{eff}}$ (see Figure 4).

For each candidate, we calculated estimates of R_p and ${\rho }_{p}$ by using the Kepler stellar catalog properties and by assuming ${A}_{g}\in [0.1-0.3]$ and $i\in [15^\circ ,85^\circ ]$ . This domain thus corresponds to an allowable region in ${\rho }_{p}-{R}_{p}$ space. The candidate's plausibility may be considered by comparing this region to the radii and density of known hot Jupiters. In Figure 15 we plot $\mathrm{ln}{\rho }_{p}$ versus R_p for all hot Jupiters with measured masses and radii, as obtained by the Exoplanets Data Explorer⁷ (Han et al. 2014). (We considered planets with $P\lt 5\ \mathrm{days}$ and $0.2\ {M}_{\mathrm{Jup}}\lt {M}_{p}\sin i\lt 5\ {M}_{\mathrm{Jup}}$ .) We also plot the ${\rho }_{p}-{R}_{p}$ regions for two candidates, one with strong physical plausibility and one with weak plausibility.

**Figure 15.** ${\rho }_{p}-{R}_{p}$ diagram that is used for additional candidate vetting. The black points with error bars correspond to known hot Jupiters with measured masses and radii; the kernel density estimation corresponding to these data points is shown with the gray gradient. The ${\rho }_{p}-{R}_{p}$ regions of two candidates are displayed, where ${A}_{g}\in [0.1-0.3]$ and $i\in [15^\circ ,85^\circ ]$ . The coloration is according to the albedo. The candidate labeled "accepted" is physically plausible since its region agrees with known hot Jupiter radii and densities. (It has a high value of S.) The candidate labeled "rejected" is less plausible.
Download figure:
Standard image High-resolution image

A quantitative metric is necessary to compare a given candidate's allowed ${\rho }_{p}-{R}_{p}$ domain to the region occupied by known hot Jupiters. We first calculated a kernel density estimation of the values for known hot Jupiters in $\mathrm{ln}{\rho }_{p}$ –R_p space, denoted $f({R}_{p},\mathrm{ln}{\rho }_{p})$ and illustrated with the gray gradient in Figure 15. We then calculated the integral, S, of the kernel density over each candidate's allowed domain,

$\begin{eqnarray}&&S={\iint }_{D}f({R}_{p},\mathrm{ln}{\rho }_{p}){{dR}}_{p}d\mathrm{ln}{\rho }_{p},\end{eqnarray} \tag{ 25 }$

where

$\begin{eqnarray}&&D=\{({R}_{p},\mathrm{ln}{\rho }_{p}):{A}_{g}\in [0.1,0.3],i\in [15^\circ ,85^\circ ]\}.\end{eqnarray} \tag{ 26 }$

We calculated the S integrals for all candidates with probabilities, $P\gt 0.5$ . We then removed all candidates with $S\lt 5$ . Although the choice of this threshold is somewhat arbitrary, it corresponds to $\gtrsim 50 \%$ of a candidate's ${\rho }_{p}$ –R_p domain overlapping with that of the known planet detections.⁸ We also verified that the four recovered transiting hot Jupiters from Section 4.4 had S integral measures greater than this threshold.

Figure 16 shows the results of the binary star filtering (Section 5.1) and the physical plausibility vetting (Section 5.2) on the aggregate of candidates with $P\gt 0.5$ . These filters clearly have a dramatic effect on the number of plausible planetary phase curve candidates.

**Figure 16.** Top panel: probability vs. amplitude for the candidates with $P\gt 0.5$ (top half of Figure 13). Bottom panel: remaining candidates after the filters from Sections 5.1 and 5.2 were applied.
Download figure:
Standard image High-resolution image

6. Results: A Catalog of Candidate Planets with Phase Curves

In the remainder of the paper, we focus our analysis on the subset of highest probability candidates. Figures 9 and 12 show that true phase curves should be reported with probabilities very close to 1. To this end, we selected all candidates with $P\gt 0.97$ and $\mathrm{Amp}\lt 70\,\mathrm{ppm}$ . Last, we established a final candidate list through a small amount of visual inspection. In particular, we inspected the Lomb–Scargle periodograms and removed any that showed signs of significant peaks apart from the primary peak. We also examined the $2P$ -folded phase curves and checked for any subtle indications of amplitude inconsistency in each half.

Following this final vetting procedure, we present 60 remaining high-probability candidates. In Table 2 we list their parameters. The parameters K_p, ${T}_{\mathrm{eff}}$ , R_⋆, M_⋆, and Fe/H are from the Q1-Q17 DR 25 Kepler stellar catalog (Mathur et al. 2017). The uncertainties on the periods were calculated using a Gaussian fit to the peak in the Lomb–Scargle periodogram. The units of ${A}_{\mathrm{refl}}$ and ${A}_{\mathrm{ellip}}$ are ppm.

We also report estimates of the candidates' minimum RV semi-amplitudes, $K\sin i$ . To calculate this, we note that the amplitude of the ellipsoidal variation component, ${A}_{\mathrm{ellip}}$ , obtained via Equation (9), is directly related to K through the relation

$\begin{eqnarray}&&K\sin i=\displaystyle \frac{{A}_{\mathrm{ellip}}}{{\alpha }_{2}}{\left(\displaystyle \frac{{{GM}}_{\star }}{a}\right)}^{1/2}{\left(\displaystyle \frac{a}{{R}_{\star }}\right)}^{3}.\end{eqnarray} \tag{ 27 }$

The error bars on $K\sin i$ were obtained via propagation of the uncertainties on the KIC stellar parameters and the ${A}_{\mathrm{ellip}}$ estimates obtained from the least-squares fit introduced in the first paragraph of Section 5.2. Unfortunately, these uncertainties are quite large, mostly due to the lack of constraint on the stellar parameters.

An online repository of the candidates may be found at https://smillholland.github.io/Non-transiting_HJs/. The repository contains a downloadable candidate catalog and a variety of diagnostic plots, including the candidate phase curves, periodograms, and ${\rho }_{p}-{R}_{p}$ diagrams. We also provide information relevant to follow-up observations, including estimates of the orbital ephemerides.

6.1. Albedo Constraints

Under the assumption that the majority of remaining high-probability candidates correspond to true planets, we are presented with a collection of phase curves that may be examined in greater detail. One quantity of interest is the planetary albedo. Without additional information, it is impossible to distinguish the degeneracy between the albedo, planet radius, and inclination. However, we can consider the composite quantity, ${A}_{g}{{R}_{p}}^{2}\sin i$ , which can be calculated directly from the observable quantities as follows:

$\begin{eqnarray}&&{A}_{g}{{R}_{p}}^{2}\sin i=2{A}_{\mathrm{refl}}{\left(\displaystyle \frac{{{GM}}_{\star }}{4{\pi }^{2}}\right)}^{2/3}{P}^{4/3}.\end{eqnarray} \tag{ 28 }$

Here, P and ${A}_{\mathrm{refl}}$ are the period and reflection semi-amplitude of the candidate phase curve, and M_⋆ is taken from the Kepler stellar catalog.

In Figure 17 we display the results of the ${A}_{g}{{R}_{p}}^{2}\sin i$ calculations. In the top panel, we show the histogram of ${A}_{g}{{R}_{p}}^{2}\sin i$ for the candidates in blue. We also show the histograms of all synthetic injections and the recovered synthetic injections from Section 3. The ratio of these two gives a recovery efficiency curve, and in the bottom panel, we use this efficiency curve to construct the bias-corrected histogram for the candidates. We compare these results to 12 known transiting hot Jupiters with phase curves from Esteves et al. (2015). It is clear from the comparison of the bias-corrected candidate CDF to the transiting hot Jupiter CDF that the candidates are a good match to known planets. The candidate CDF is slightly shifted toward lower values, which is an expected outcome of the bias correction. Furthermore, the bias-corrected histogram peaks at ${A}_{g}{{R}_{p}}^{2}\sin i\sim 0.1$ , providing further evidence that hot Jupiters are generally fairly dark.

**Figure 17.** Constraints on the candidates' albedo through the quantity ${A}_{g}{\left(\tfrac{{R}_{p}}{1.3{R}_{\mathrm{Jup}}}\right)}^{2}\sin i$ . The top panel contains histograms for the injected and recovered synthetic data and the candidates, where the candidate histogram has been normalized to be on the same scale as the synthetic data histograms. The bottom panel contains the bias-corrected candidate histogram and cumulative distribution function. Both panels show the values for known transiting planets with phase curves from Esteves et al. (2015), and the bottom panel shows the corresponding CDFs.
Download figure:
Standard image High-resolution image

It is interesting to check whether the candidates show any systematic trends in albedo with other quantities. In Figure 18 we plot ${A}_{g}{{R}_{p}}^{2}\sin i$ versus ${T}_{\mathrm{irr}}$ , where ${T}_{\mathrm{irr}}$ is defined as

$\begin{eqnarray}&&{T}_{\mathrm{irr}}={T}_{\mathrm{eff}}{\left(\displaystyle \frac{{R}_{\star }}{a}\right)}^{1/2},\end{eqnarray} \tag{ 29 }$

such that the incident stellar irradiation is ${F}_{0}=\sigma {T}_{\mathrm{irr}}^{4}$ (Perna et al. 2012). ${T}_{\mathrm{irr}}$ is not the planet's equilibrium temperature, which is rather given by

$\begin{eqnarray}&&{T}_{\mathrm{eq}}={T}_{\mathrm{eff}}{\left(\displaystyle \frac{{R}_{\star }}{a}\right)}^{1/2}{[f(1-{A}_{B})]}^{1/4},\end{eqnarray} \tag{ 30 }$

where A_B is the Bond albedo and f is a reradiation factor ranging between 1/4 (homogeneous redistribution) and 2/3 (instant reradiation) (Esteves et al. 2013). However, ${T}_{\mathrm{irr}}$ is a close proxy to ${T}_{\mathrm{eq}}$ .

The gray gradient in Figure 18 illustrates regions with stronger detection bias. This was calculated by finding the point density of recovered synthetic injections and dividing it by the point density of all synthetic injections.

There is little obvious trend in ${A}_{g}{{R}_{p}}^{2}\sin i$ versus ${T}_{\mathrm{irr}}$ . There is a weak indication that the average ${A}_{g}{{R}_{p}}^{2}\sin i$ decreases as a function of ${T}_{\mathrm{irr}}$ up to ${T}_{\mathrm{irr}}\sim 2700\ {\rm{K}}$ , at which point the phase curve would start involving substantial thermal radiation, and A_g would not just include reflected light. This may suggest that cooler planets are brighter on average, with a greater proportion of reflective clouds. It is not surprising to see the lack of a strong relationship, however. We are using Kepler stellar catalog parameters, so ${T}_{\mathrm{irr}}$ is not known to high accuracy. It would be valuable to revisit this after more precise stellar parameters have been obtained. Data from the Gaia spacecraft (Gaia Collaboration et al. 2016) may be a near-term resource for these improved estimates.

6.2. Offsets of the Phase Curve Maxima

A phase offset between the phase curve maximum and substellar point has been observed for most planetary phase curves. The maximum is shifted eastward for thermal phase curves (Showman & Guillot 2002; Knutson et al. 2007), as superrotating jets advect the hottest region downwind. Both eastward and westward shifts have been observed for optical phase curves (Esteves et al. 2015; Shporer & Hu 2015).

For phase curves of transiting planets, the phase offset is easily determined by comparing the phase curve maximum to the secondary eclipse. Determining ${\rm{\Delta }}\theta$ is not easy for non-transiting planets, but it is possible in some cases. If beaming and/or ellipsoidal components of the light curve are detected, the phase offset can be obtained from the phase relationship between the reflected light and beaming/ellipsoidal components.

Toward this goal, we used the three-component phase curve fits (Equation (22)) of the candidates from Section 5.2. We calculated the phase offset between the reflection maximum and substellar point using ${\rm{\Delta }}\theta ={\theta }_{\mathrm{refl}}-{\theta }_{\mathrm{ellip}}$ . For the candidate pictured in Figure 14, the phase offset is consistent with an eastward shift of the phase curve maximum.

For each of the candidate phase curve fits, we also calculated the differences in the Aikaike information criterion (AIC) between the one-component and three-component fits. If the three-component fit is favored, then ${\rm{\Delta }}\mathrm{AIC}$ is large and the ellipsoidal/beaming components are more significant. In Figure 19 we plot ${\rm{\Delta }}\theta$ versus ${\rm{\Delta }}\mathrm{AIC}$ and ${T}_{\mathrm{irr}}$ . ${\rm{\Delta }}\theta \gt 0$ corresponds to eastward shifts in the phase curve maxima (i.e., the phase curve peaks before the substellar point), while ${\rm{\Delta }}\theta \lt 0$ corresponds to westward shifts.

From the left panel in Figure 19, we see that candidates with smaller ${\rm{\Delta }}\mathrm{AIC}$ favor westward shifts in the phase curve maxima, whereas the maxima of candidates with larger ${\rm{\Delta }}\mathrm{AIC}$ tend to be eastward shifted. Moreover, the right panel provides weak evidence for a positive correlation of ${\rm{\Delta }}\theta$ with ${T}_{\mathrm{irr}}$ . Both of these correlations strengthen appreciably when considering candidates with higher probability thresholds (e.g., $P\gt 0.985$ , rather than the 0.97 used here). We emphasize again that we are using Kepler stellar catalog parameters in the calculation of ${T}_{\mathrm{irr}}$ , which likely explains much of the scatter in Figure 19. It will be valuable to see how the results improve with more reliable spectroscopic and astroseismic stellar parameter estimates.

It is logical that ${\rm{\Delta }}\mathrm{AIC}$ and ${T}_{\mathrm{irr}}$ should show a similarly signed correlation with ${\rm{\Delta }}\theta$ , as ${\rm{\Delta }}\mathrm{AIC}$ is closely related to ${T}_{\mathrm{irr}}$ . As ${\rm{\Delta }}\mathrm{AIC}$ increases, the ellipsoidal component of the light curve becomes more significant. Among other dependencies, this correlates with a small a or large R_⋆ (see Equation (9)), both of which would act to increase ${T}_{\mathrm{irr}}$ (Equation (29)).

Several previous studies have also observed a positive trend in ${\rm{\Delta }}\theta$ versus ${T}_{\mathrm{irr}}$ or ${T}_{\mathrm{eq}}$ (Angerhausen et al. 2015; Esteves et al. 2015; Shporer & Hu 2015; Shporer 2017). Optical phase curves contain contributions from both reflected light and thermal radiation. For the hottest planets, the phase curve contains a significant thermal contribution, and the eastward phase shift matches that which is observed in IR phase curves. This phase shift is thought to be due to a superrotating equatorial jet advecting the hot spot eastward of the substellar point (e.g., Showman & Guillot 2002). For the coolest planets, the phase curve is reflection dominated, and the westward phase shift has been attributed to the presence of reflective clouds condensing in the cooler regions of the planet, westward of the substellar point (e.g., Shporer 2017). The linear correlation in ${\rm{\Delta }}\theta$ versus ${T}_{\mathrm{irr}}$ could then just arise from the relative contribution of each phase curve component.

The presence of both thermal and reflection components in the phase curve makes it challenging to draw physical interpretations of the atmospheric dynamics. Considering the hot spot alone, an increase in ${T}_{\mathrm{irr}}$ should result in a decrease in the magnitude of the phase offset, as heat redistribution becomes less efficient (Perna et al. 2012). This arises from the balance between radiative and advective timescales, ${t}_{\mathrm{rad}}$ and ${t}_{\mathrm{adv}}$ . With

$\begin{eqnarray}&&{t}_{\mathrm{rad}}=\displaystyle \frac{{c}_{P}P}{{\sigma }_{\mathrm{SB}}\ {g}_{p}\ {T}_{\max }^{3}}\end{eqnarray} \tag{ 31 }$

and

$\begin{eqnarray}&&{t}_{\mathrm{adv}}=\displaystyle \frac{R}{{v}_{\theta }},\end{eqnarray} \tag{ 32 }$

${t}_{\mathrm{adv}}/{t}_{\mathrm{rad}}$ increases rapidly with ${T}_{\mathrm{irr}}$ , such that the heat redistribution efficiency and the hot-spot offset both decrease with ${T}_{\mathrm{irr}}$ (Perna et al. 2012).

Ohmic dissipation, which becomes significant at large ${T}_{\mathrm{irr}}$ as the consequence of an increase in thermal ionization and the resulting electrical conductivity, should also act to slow the zonal winds and reduce the magnitude of the hot-spot offset (Perna et al. 2010a, 2010b; Batygin et al. 2013). This effect works constructively with the balance between ${t}_{\mathrm{adv}}$ and ${t}_{\mathrm{rad}}$ described above.

In order to see these theorized trends of ${\rm{\Delta }}\theta$ versus ${T}_{\mathrm{irr}}$ in the data, it would be necessary to isolate the thermal component from the reflection component of the phase curve (e.g., Armstrong et al. 2016). This would be a valuable undertaking when the candidates presented in this paper are confirmed and the planets' irradiation temperatures may be determined more accurately.

7. Discussion

7.1. Caveats and Future Work

There is an important caveat related to the identification and characterization of non-transiting planet candidates presented in this paper. The concern relates to the effects of nearby stellar companions. Using high-resolution optical and near-IR imaging, Furlan et al. (2017) found that $\sim 30 \%$ of KOI host stars have at least one companion star within 4''. Since the Kepler detector has ∼4'' wide pixels and the aperture photometry is obtained from at least a few pixels, this implies that $\gtrsim 30 \%$ of the candidates presented here will have signal amplitudes that are smaller than they would otherwise be without nearby companions.

In addition to influencing the constraints on the candidates' albedo, planet radius, and inclination (see Section 6.1), this photometric contamination can also result in false positive phase curve detections in the form of background or foreground binary star phase curves, analogous to the dominant transit false positive sources (e.g., Morton et al. 2016). Although the binary star filtering and physical plausibility vetting outlined in Sections 5.1 and 5.2 has certainly helped reduce some of these false positives, we do not claim to have fully accounted for the problem.

It might be possible to further reduce the effects of background or foreground binary stars by including them in the synthetic data set that was first produced in Section 3.2. To do this, one could first assume a distribution of angular separations and magnitude differences of nearby stellar companions using the results of Furlan et al. (2017). Then, the distribution of short-period binary stars could be empirically modeled after the findings of Kirk et al. (2016) and other papers in this series. After injecting signals appropriate to these distributions, one could search for predictors to add to the logistic regression that would help distinguish the binary signals from the planetary phase curve signals.

7.2. Candidate Follow-up Observations

The main result of this paper is a catalog of 60 non-transiting short-period giant planet candidates with phase curves (presented in Section 6). These candidates can be confirmed with follow-up radial velocity (RV) observations. Even though the target stars are dim, hot Jupiters are readily detectable even with low RV precision. Moreover, the known orbital ephemerides from the phase curves will significantly aid the RV measurements. When the RVs confirm a planet candidate's existence, the comparison between the RV and photometric ephemerides will allow for a more precise measurement of the shift between the phase curve maximum and substellar point. Moreover, the combined RV photometric modeling could permit a break in the mass/inclination degeneracy.

After RV confirmation, other valuable follow-up observations include Spitzer infrared photometry of the brightest targets. There are very few planets with both optical and infrared phase curve observations (Christiansen et al. 2010; Demory et al. 2013). The combination of these observations, and particularly a comparison of the offset of the phase curve maxima, would be strongly constraining on the planets' atmospheric dynamics.

7.3. Candidate Stellar Metallicities and Effective Temperatures

It is well known that the occurrence rate of giant planets correlates strongly with stellar metallicity (Santos et al. 2004; Fischer & Valenti 2005; Johnson et al. 2010). Consideration of the stellar metallicity of the highest probability candidates thus provides a strong independent check on their validity; the distribution of candidate metallicities should be systematically higher than the metallicity distribution of the remaining population.

The metallicities reported in the Kepler stellar catalog have been primarily photometrically derived (Brown et al. 2011) and have also been shown to systematically underestimate the true metallicity and scatter (Dong et al. 2014; Petigura et al. 2017). Dong et al. (2014) used LAMOST spectroscopic data to show that the Kepler metallicities are best fit by the relation

$\begin{eqnarray}&&{[\mathrm{Fe}/{\rm{H}}]}_{\mathrm{KIC}}=-0.20+0.43{[\mathrm{Fe}/{\rm{H}}]}_{\mathrm{LAMOST}},\end{eqnarray} \tag{ 33 }$

with the relation most secure in the range $-0.3\,\lt {[\mathrm{Fe}/{\rm{H}}]}_{\mathrm{LAMOST}}\lt 0.4$ . It is thus necessary to interpret any analysis of Kepler metallicities with caution. Nevertheless, it should still be meaningful to compare the statistical distribution of candidate metallicities to the remaining population.

In Figure 20 we display the metallicity histograms and cumulative distribution functions of all FGK stars in gray and of the candidate stars above certain probability thresholds in blue and green. Clearly, the metallicities of the candidates are systematically higher than the metallicities of the remaining FGK stars. The candidates above the $P\gt 0.985$ threshold are shifted further, indicating a smaller amount of false positive contamination as P increases.

An identical analysis may be performed for ${T}_{\mathrm{eff}}$ , since the occurrence of hot Jupiters drops off sharply for ${T}_{\mathrm{eff}}\lesssim 5000$ K. As shown in Figure 20, ${T}_{\mathrm{eff}}$ is systematically high for the candidates, which should be expected if the candidates are true giant planets.

7.4. KOI Host Stars

Our analysis has focused on the collection of Kepler FGK stars without known planets or planet candidates. Although the search for non-transiting hot Jupiters around Kepler transiting planet hosts was the original motivation of this work (Millholland et al. 2016), here we have intentionally excluded them. In KOI systems, the detection of astrometrically induced transit-timing variations for the transiting planet may serve as an additional predictor for the existence of a non-transiting giant planet. These targets are thus best handled separately, and we will be addressing them in future work.

8. Conclusion

The prospects of using phase curves to detect non-transiting Jupiter-mass planets has been an outstanding problem for some time (Faigler & Mazeh 2011; Placek et al. 2014; Shporer & Hu 2015; Shporer 2017). In this paper, we constructed a supervised learning algorithm for the detection of non-transiting Kepler planets with optical phase curves. Our algorithm relies on exploiting the time-dependent properties of planetary phase curves, namely the fact that phase curves are more temporally consistent in their amplitudes and phase compared to other types of light curve variability.

We first developed a pipeline to identify candidate phase curve signals in Kepler light curves. We demonstrated the pipeline's recovery efficiency using light curves containing synthetically injected phase curves. The phase curves recovered by the pipeline exhibited significantly different properties from "non-phase curves," that is, signals where the pipeline's detection did not match the injected phase curve. We then developed a logistic regression algorithm in a supervised learning context to classify phase curves and non-phase curves. The algorithm performs exceedingly well in its ability to predict phase curves in synthetic data sets and the set of Kepler transiting hot Jupiters.

We applied our algorithm to the full set of Kepler FGK stars without known planets or planetary candidates and identified 60 high-probability planet candidates. We examined trends in the candidates' albedos and phase offsets and discussed physical explanations for the observed trends. All of our candidates are available for inspection at https://smillholland.github.io/Non-transiting_HJs/.

For more than two decades, the characterization of hot Jupiters has been consistently improved via careful observational study from both the ground and space. Yet many zeroth-order aspects of these bizarre worlds—including their energetics, their interior structures, and their origins—remain poorly understood. Aided by the supervised machine learning detection of optical phase curves, a substantial augmentation to the known population of hot Jupiters might be obtained. Planetary candidates identified in this manner can be readily verified with sparsely sampled Doppler velocity observations. These measurements will not only enable the confirmation of new planets, but will also enhance the analysis of the phase curves and aid in the understanding of the planets' atmospheric dynamics. We are particularly optimistic that the techniques outlined here can be refined and extended to forthcoming space-based photometric surveys, including, notably, the TESS and Plato Missions.

We are indebted to the referee, Jon Jenkins, whose expertise in the subject and thorough review of the manuscript helped facilitate significant improvements. S.M. is supported by the National Science Foundation Graduate Research Fellowship Program under Grant Number DGE-1122492. This material is also based upon work supported by the National Aeronautics and Space Administration through the NASA Astrobiology Institute under Cooperative Agreement Notice NNH13ZDA017C issued through the Science Mission Directorate. We acknowledge support from the NASA Astrobiology Institute through a cooperative agreement between NASA Ames Research Center and Yale University. This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program. This work has also made use of the Exoplanet Orbit Database and the Exoplanet Data Explorer at exoplanets.org.

Supervised Learning Detection of Sixty Non-transiting Hot Jupiter Candidates

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. An Automated Detection Pipeline

2.1. Step 1—Identifying Candidate Periods

2.2. Step 2—Phase Curve Fitting

2.3. Step 3—Quantifying Residual Autocorrelations

3. Synthetic Phase Curve Injection and Recovery

3.1. Phase Curve Model

3.2. Light Curve Injection

3.3. Recovery Efficiency

4. Binary Classification Using Logistic Regression

4.1. Logistic Regression Overview

4.2. Phase Curve Classification Using Logistic Regression

4.3. Testing on Synthetic Phase Curves

4.4. Testing on Transiting Kepler Hot Jupiters

5. Application to Kepler FGK Stars

5.1. Binary Star Filtering

5.2. Further Vetting via Physical Plausibility Considerations

6. Results: A Catalog of Candidate Planets with Phase Curves

6.1. Albedo Constraints

6.2. Offsets of the Phase Curve Maxima

7. Discussion

7.1. Caveats and Future Work

7.2. Candidate Follow-up Observations

7.3. Candidate Stellar Metallicities and Effective Temperatures

7.4. KOI Host Stars

8. Conclusion

Footnotes

Supervised Learning Detection of Sixty Non-transiting Hot Jupiter Candidates

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. An Automated Detection Pipeline

2.1. Step 1—Identifying Candidate Periods

2.2. Step 2—Phase Curve Fitting

2.3. Step 3—Quantifying Residual Autocorrelations

3. Synthetic Phase Curve Injection and Recovery

3.1. Phase Curve Model

3.2. Light Curve Injection

3.3. Recovery Efficiency

4. Binary Classification Using Logistic Regression

4.1. Logistic Regression Overview

4.2. Phase Curve Classification Using Logistic Regression

4.3. Testing on Synthetic Phase Curves

4.4. Testing on Transiting Kepler Hot Jupiters

5. Application to Kepler FGK Stars

5.1. Binary Star Filtering

5.2. Further Vetting via Physical Plausibility Considerations

6. Results: A Catalog of Candidate Planets with Phase Curves

6.1. Albedo Constraints

6.2. Offsets of the Phase Curve Maxima

7. Discussion

7.1. Caveats and Future Work

7.2. Candidate Follow-up Observations

7.3. Candidate Stellar Metallicities and Effective Temperatures

7.4. KOI Host Stars

8. Conclusion

Footnotes